This is an analysis report of storm data. There are two questions we want to analyze, they are:
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
to complete this analysis, we applied 3 packages to help.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(ggplot2)
In this report, there are four parts of the chapter, first is Synopsis, second is Data Processing, third is Results, and fourth is summary.
This report is to analyze the damage caused by climate disasters in the United States. We used storm data source from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, our goals are 1. find which types of events are the most harmful with respect to population health, 2. find which types of events have the greatest economic?
After our analysis, we find that tornado is the most harmful events to population, they caused 5633 fatalities and 91346 injuries. on the other hand, floods have greatest economic consequence.
Firstly, we read the data from the original bz2 file and access it into a dataset named “rawStorm”.
rawStorm <- tbl_df(read_csv("repdata_data_StormData.csv.bz2"))
## Parsed with column specification:
## cols(
## .default = col_double(),
## BGN_DATE = col_character(),
## BGN_TIME = col_character(),
## TIME_ZONE = col_character(),
## COUNTYNAME = col_character(),
## STATE = col_character(),
## EVTYPE = col_character(),
## BGN_AZI = col_logical(),
## BGN_LOCATI = col_logical(),
## END_DATE = col_logical(),
## END_TIME = col_logical(),
## COUNTYENDN = col_logical(),
## END_AZI = col_logical(),
## END_LOCATI = col_logical(),
## PROPDMGEXP = col_character(),
## CROPDMGEXP = col_logical(),
## WFO = col_logical(),
## STATEOFFIC = col_logical(),
## ZONENAMES = col_logical(),
## REMARKS = col_logical()
## )
## See spec(...) for full column specifications.
## Warning: 5255570 parsing failures.
## row col expected actual file
## 1671 WFO 1/0/T/F/TRUE/FALSE NG 'repdata_data_StormData.csv.bz2'
## 1673 WFO 1/0/T/F/TRUE/FALSE NG 'repdata_data_StormData.csv.bz2'
## 1674 WFO 1/0/T/F/TRUE/FALSE NG 'repdata_data_StormData.csv.bz2'
## 1675 WFO 1/0/T/F/TRUE/FALSE NG 'repdata_data_StormData.csv.bz2'
## 1678 WFO 1/0/T/F/TRUE/FALSE NG 'repdata_data_StormData.csv.bz2'
## .... ... .................. ...... ................................
## See problems(...) for more details.
str(rawStorm)
## Classes 'tbl_df', 'tbl' and 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : logi NA NA NA NA NA NA ...
## $ BGN_LOCATI: logi NA NA NA NA NA NA ...
## $ END_DATE : logi NA NA NA NA NA NA ...
## $ END_TIME : logi NA NA NA NA NA NA ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : logi NA NA NA NA NA NA ...
## $ END_LOCATI: logi NA NA NA NA NA NA ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : num 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: logi NA NA NA NA NA NA ...
## $ WFO : logi NA NA NA NA NA NA ...
## $ STATEOFFIC: logi NA NA NA NA NA NA ...
## $ ZONENAMES : logi NA NA NA NA NA NA ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : logi NA NA NA NA NA NA ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
## - attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame': 5255570 obs. of 5 variables:
## ..$ row : int 1671 1673 1674 1675 1678 1679 1680 1681 1682 1683 ...
## ..$ col : chr "WFO" "WFO" "WFO" "WFO" ...
## ..$ expected: chr "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" ...
## ..$ actual : chr "NG" "NG" "NG" "NG" ...
## ..$ file : chr "'repdata_data_StormData.csv.bz2'" "'repdata_data_StormData.csv.bz2'" "'repdata_data_StormData.csv.bz2'" "'repdata_data_StormData.csv.bz2'" ...
## - attr(*, "spec")=
## .. cols(
## .. STATE__ = col_double(),
## .. BGN_DATE = col_character(),
## .. BGN_TIME = col_character(),
## .. TIME_ZONE = col_character(),
## .. COUNTY = col_double(),
## .. COUNTYNAME = col_character(),
## .. STATE = col_character(),
## .. EVTYPE = col_character(),
## .. BGN_RANGE = col_double(),
## .. BGN_AZI = col_logical(),
## .. BGN_LOCATI = col_logical(),
## .. END_DATE = col_logical(),
## .. END_TIME = col_logical(),
## .. COUNTY_END = col_double(),
## .. COUNTYENDN = col_logical(),
## .. END_RANGE = col_double(),
## .. END_AZI = col_logical(),
## .. END_LOCATI = col_logical(),
## .. LENGTH = col_double(),
## .. WIDTH = col_double(),
## .. F = col_double(),
## .. MAG = col_double(),
## .. FATALITIES = col_double(),
## .. INJURIES = col_double(),
## .. PROPDMG = col_double(),
## .. PROPDMGEXP = col_character(),
## .. CROPDMG = col_double(),
## .. CROPDMGEXP = col_logical(),
## .. WFO = col_logical(),
## .. STATEOFFIC = col_logical(),
## .. ZONENAMES = col_logical(),
## .. LATITUDE = col_double(),
## .. LONGITUDE = col_double(),
## .. LATITUDE_E = col_double(),
## .. LONGITUDE_ = col_double(),
## .. REMARKS = col_logical(),
## .. REFNUM = col_double()
## .. )
The rawStorm data content with 37 variables and 1773320 objects. However, we want to analysis which event is most harmful to human, so we only need an object with events.
rawStorm<- filter(rawStorm,EVTYPE!="",EVTYPE!=" ",!is.na(EVTYPE))
rawStorm %>% group_by(EVTYPE)
Because we have two different questions to analyze, so we split the data into two data sets which contain different variables.
To find out most harmful with respect to population health, we need to analysis fatalities and injuries. we summary total fatalities and total injures by each event, and show the top 3 events.
harmful_with_health<-select(rawStorm ,EVTYPE,FATALITIES,INJURIES)%>%
group_by(EVTYPE)%>%
summarize(total_fatalities=sum(FATALITIES),total_injuries=sum(INJURIES))%>%
arrange(desc(total_fatalities),desc(total_injuries))%>%
head(3)
# cost_of_economic<-rawStorm [,c("X.EVTYPE.", "X.PROPDMG.","X.CROPDMG.")]
harmful_with_health
to show the comparison of total fatalities and total injuries, we apply the bar plot as follow:
To find out the greatest economic consequences, we need to analysis property damage and crop damage. However, the variable “PROPDMGEXP” and “CROPDMGEXP” are use letters to represent the multiples, so we have to change they into numbers, so the can be in the same units, after that we arrange those damage and show the top 5.
cost_of_economic<-select(rawStorm ,EVTYPE,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)%>%
mutate(prop_exp=ifelse(PROPDMGEXP=="B"|PROPDMGEXP=="b",1000000000,
ifelse(PROPDMGEXP=="M"|PROPDMGEXP=="m",1000000,
ifelse(PROPDMGEXP=="K"|PROPDMGEXP=="k",1000,
ifelse(PROPDMGEXP=="H"|PROPDMGEXP=="h",100,
1)))),
crop_exp=ifelse(CROPDMGEXP=="B"|CROPDMGEXP=="b",1000000000,
ifelse(CROPDMGEXP=="M"|CROPDMGEXP=="m",1000000,
ifelse(CROPDMGEXP=="K"|CROPDMGEXP=="k",1000,
ifelse(CROPDMGEXP=="H"|CROPDMGEXP=="h",100,
1)))),
prop_dm=PROPDMG*prop_exp,crop_dm=CROPDMG*crop_exp)%>%
group_by(EVTYPE)%>%
summarize(total_prop_dm=sum(prop_dm,na.rm = T),total_crop_dm=sum(crop_dm,na.rm = T))%>%
arrange(desc(total_prop_dm),desc(total_crop_dm))%>%
head(5)
cost_of_economic
In this report, we can see that tornado caused the most harmful with population health, which caused 5633 fatalities and 91346 injuries, st the same time, they also cause a lot of property damages. Another serious disaster is floods, which have caused the greatest damages of economic cost, and a lot of casualties too.